These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status ('Current'
, 'Late'
, 'Fully Paid'
, etc.) and latest payment information. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. Here, we use a previously transformed data set, which is however a full copy of the original one. For more information, or if you want to download these data, consult:
In [1]:
# Required Libraries
import os
import pandas as pd
import numpy as np
import folium
from pprint import pprint
#from time import time
from __future__ import print_function
np.set_printoptions(precision=4, suppress=True)
from matplotlib import pyplot as plt
import matplotlib as mlb
import seaborn as sns
In [2]:
# Matplotlib/Seaborn Parameters
mlb.style.use('seaborn-whitegrid')
sns.set_context('talk')
%matplotlib inline
mlb.rcParams['figure.titleweight'] = 'bold'
mlb.rcParams['axes.labelweight'] = 'bold'
mlb.rcParams['axes.titleweight'] = 'bold'
mlb.rcParams['figure.figsize'] = [10,6]
In [3]:
# User-defined Functions [Loaded from: /media/ML_HOME/ML-Code_Base (through a .pth file)]
from visualization_helper_functions import freq_tables
from ML_helper_functions import cv_estimate
In [4]:
# Path Definitions of Required Data Sets
loan_df_path = os.path.join('/media/ML_HOME/ML-Data_Repository/data', 'loan_df')
us_states_GeoJSON = os.path.join('/media/ML_HOME/ML-Data_Repository/maps', 'us_states-albersUSA-Geo.json')
The loan.csv
has been loaded and transformed using the lending_club_loan_data-eda.py
script. The data types of the all the attributes have been appropriately defined, as well as the categories of all the nominal and ordinal categorical variables. The transformed data set has ~890K observations, 75 attributes, and it is ~420MB in size. Note, that some 'NA'
values exist in specific float
and str
variables, but this is expected.
In [5]:
loan_df = pd.read_pickle(loan_df_path)
In [6]:
loan_df.shape
Out[6]:
In [7]:
loan_df.info()
In [8]:
loan_df.head()
Out[8]:
In [9]:
num_unique_values = len(pd.unique(loan_df['id']))
num_records = len(loan_df['id'])
is_key = (num_unique_values == num_records)
print('Checks the uniqueness of \'id\' [key]: %s' % is_key)
In [10]:
num_unique_values = len(pd.unique(loan_df['member_id']))
num_records = len(loan_df['member_id'])
is_key = (num_unique_values == num_records)
print('Checks the uniqueness of \'member_id\' [key]: %s' % is_key)
All date fields are valid:
'issue_d'
: The month which the loan was funded ('Jun-2007'
-'Dec-2015'
).'earliest_cr_line'
: The month the borrower's earliest reported credit line was opened ('Jan-1944'
-'Nov-2012'
).'last_pymnt_d'
: Last month payment was received ('Dec-2007'
-'Jan-2016'
).'next_pymnt_d'
: Next scheduled payment date ('Dec-2007'
-'Mar-2016'
).'last_credit_pull_d'
: The most recent month LC pulled credit for this loan ('May-2007'
-'Jan-2016'
).
In [11]:
loan_df.select_dtypes(include=['datetime64']).describe()
Out[11]:
In [12]:
dt_attribs = list(loan_df.select_dtypes(include=['datetime64']).columns)
dt_attribs
Out[12]:
In [13]:
for attrib in dt_attribs:
pprint(loan_df.loc[:3,attrib])
print('\n')
Some 'NA'
values exist, but this is not a problem for text attributes. In particular, these text attributes are:
'emp_title'
: The job title supplied by the Borrower when applying for the loan (Employer Title replaces Employer Name for all loans listed after 9/23/2013).'url'
: URL for the LC page with listing data.'desc'
: Loan description provided by the borrower.'title'
: The loan title provided by the borrower.and we are not going to use them, in this first stage of predictive modelling.
In [14]:
loan_df.select_dtypes(include=['object']).info()
In [15]:
str_attribs = list(loan_df.select_dtypes(include=['object']).columns)
str_attribs
Out[15]:
In [16]:
for attrib in str_attribs:
pprint(loan_df.loc[:3, attrib])
print('\n')
All float attributes are important for the predictive model we want to build. Some specific attributes, such as the 'dti_joint'
and the 'annual_inc_joint
', have a plethora of 'NA'
values, but this is reasonable and expected.
'loan_amnt'
: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.'funded_amnt'
: The total amount committed to that loan at that point in time.'funded_amnt_inv'
: The total amount committed by investors for that loan at that point in time.'int_rate'
: Interest Rate on the loan.'installment'
: The monthly payment owed by the borrower if the loan originates.'annual_inc'
: The self-reported annual income provided by the borrower during registration.'dti'
: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.'revol_bal'
: Total credit revolving balance.'revol_util'
: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.'out_prncp'
: Remaining outstanding principal for total amount funded.'out_prncp_inv'
: Remaining outstanding principal for portion of total amount funded by investors.'total_pymnt'
: Payments received to date for total amount funded.'total_pymnt_inv'
: Payments received to date for portion of total amount funded by investors.'total_rec_prncp'
: Principal received to date.'total_rec_int'
: Interest received to date.'total_rec_late_fee'
: Late fees received to date.'recoveries'
: Post charge off gross recovery.'collection_recovery_fee'
: Post charge off collection fee.'last_pymnt_amnt'
: Last total payment amount received.'annual_inc_joint'
: The combined self-reported annual income provided by the co-borrowers during registration'dti_joint'
: A ratio calculated using the co-borrowers' total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers' combined self-reported monthly income.'tot_cur_bal'
: Total current balance of all accounts.'total_bal_il'
: Total current balance of all installment accounts.'il_util'
: Ratio of total current balance to high credit/credit limit on all install acct.'max_bal_bc'
: Maximum current balance owed on all revolving accounts.'all_util'
: Balance to credit limit on all trades.'total_rev_hi_lim '
: Total revolving high credit/credit limit.
In [17]:
loan_df.select_dtypes(include=['float']).info()
In [18]:
pprint(loan_df.select_dtypes(include=['float']).describe())
All integer attributes are important for the predictive model we want to build. The 'NA'
values where exist have been set to '-9999'
.
'delinq_2yrs'
: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years.'inq_last_6mths'
: Number of credit inquiries in past 12 months.'mths_since_last_delinq'
: The number of months since the borrower's last delinquency.'mths_since_last_record'
: The number of months since the last public record.'open_acc'
: The number of open credit lines in the borrower's credit file.'pub_rec'
: Number of derogatory public records.'total_acc'
: The total number of credit lines currently in the borrower's credit file.'collections_12_mths_ex_med'
: Number of collections in 12 months excluding medical collections.'mths_since_last_major_derog'
: Months since most recent 90-day or worse rating.'acc_now_delinq'
: The number of accounts on which the borrower is now delinquent.'tot_coll_amt'
: Total collection amounts ever owed.'open_acc_6m'
: Number of open trades in last 6 months.'open_il_6m'
: Number of currently active installment trades.'open_il_12m'
: Number of installment accounts opened in past 12 months.'open_il_24m'
: Number of installment accounts opened in past 24 months.'mths_since_rcnt_il'
: Months since most recent installment accounts opened.'open_rv_12m'
: Number of revolving trades opened in past 12 months.'open_rv_24m'
: Number of revolving trades opened in past 24 months.'inq_fi'
: Number of personal finance inquiries.'total_cu_tl'
: Number of finance trades.'inq_last_12m'
: Number of credit inquiries in past 12 months.
In [19]:
loan_df.select_dtypes(include=['int']).info()
In [20]:
pprint(loan_df.select_dtypes(include=['int']).iloc[:,2:]
.describe())
In [21]:
response = 'loan_status'
categ_df = loan_df.select_dtypes(include=['category'])
categ_attribs = list(categ_df.columns)
nominal_categ_attribs = []; ordinal_categ_attribs = []
for attrib in categ_attribs:
if categ_df[attrib].cat.ordered:
ordinal_categ_attribs.append(attrib)
else:
nominal_categ_attribs.append(attrib)
We chose to define as nominal categorical attributes the variables below:
'home_ownership'
: The home ownership status provided by the borrower during registration (Values: 'ANY'
, 'MORTGAGE'
, 'NONE'
, 'OTHER'
, 'OWN'
, 'RENT'
).'pymnt_plan'
: Indicates if a payment plan has been put in place for the loan (Values: 'y'
/'n'
).'purpose'
: A category provided by the borrower for the loan request (Values: 'car'
, 'credit_card'
, 'debt_consolidation'
, 'educational'
, ..., 'renewable_energy'
, 'small_business'
, 'vacation'
, 'wedding
').'zip_code'
: The first 3 numbers of the zip code provided by the borrower in the loan application (Values: US Zip Codes).'addr_state'
: The state provided by the borrower in the loan application (Values: US State Name).'initial_list_status'
: The initial listing status of the loan (Values: 'w'
/'f'
).'policy_code'
: Publicly available products ('policy_code'='1'
), New products not publicly available ('policy_code'='2'
).'application_type'
: Indicates whether the loan is an individual application or a joint application with two co-borrowers (Values: 'INDIVIDUAL'
/'JOINT'
).Among them, these two cannot be important for the response variable outcome, 'loan_status'
:
'pymnt_plan'
(having only 10 'y'
values, and all the remaining ones being 'n'
)'policy_code'
(there are no products which are not publicly available, i.e. 'policy code'
: '2'
)Note, that the 'application_type'
attribute which indicates whether a loan is an individual or a joint application
has only 511 'JOINT'
applications, whereas all the remaining ones (886868!) have been made by individuals. However, we choose to keep this variable for the time being, and check below if the response variable appears any difference among the two types of loans.
In [22]:
pprint(nominal_categ_attribs)
In [23]:
categ_df[nominal_categ_attribs].info()
In [24]:
for attrib in nominal_categ_attribs:
s = 'Nominal Categorical Attribute: "%s":' % attrib
print(s)
print('-' * (len(s)+3))
pprint(categ_df.loc[:3, attrib])
s = '\nValue Counts:'
print(s)
print('-' * (len(s)+3))
pprint(pd.value_counts(categ_df[attrib]))
print('\n')
We chose to define as ordinal categorical attributes the variables below:
'term'
: The number of payments on the loan
Categories (2, object): [36 months < 60 months]
'grade'
: LC assigned loan grade.
Categories (7, object): [G < F < E < D < C < B < A]
'sub_grade'
: LC assigned loan subgrade.
Categories (35, object): [G5 < G4 < G3 < G2 ... A4 < A3 < A2 < A1]
'emp_length'
: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
Categories (12, object): [n/a < < 1yr < 1yr < 2yrs ... 7yrs < 8yrs < 9yrs < 10+yrs]
'verification_status'
: Indicates if the borrowers' income was verified by LC, not verified, or if the income source was verified.
Categories (3, object): [Not Verified < Source Verified < Verified]
'verification_status_joint'
: Indicates if the co-borrowers' joint income was verified by LC, not verified, or if the income source was verified.
Categories (4, object): [Not Known < Not Verified < Source Verified < Verified]
Among them, the 'verification_status_joint'
variable appears a plethora of 'Not Known'
values, but we choose not to exclude it from our consideration, at least for the time being. This field keeps the verification status of the co-borrowers' joint income, and can provide useful information for the 'JOINT'
loan applications. Note also, that the sum of the value counts concerning its remaining, informative values, i.e. 'Not Verified'
, 'Source Verified'
and 'Verified'
, equals the number of the recorded 'JOINT'
loan applications, i.e. 511:
Nominal Categorical Attribute: "application_type":
-----------------------------------------------------
0 INDIVIDUAL
1 INDIVIDUAL
Name: application_type, dtype: category
Categories (2, object): [INDIVIDUAL, JOINT]
Value Counts:
-----------------
INDIVIDUAL 886868
JOINT 511
Name: application_type, dtype: int64
Ordinal Categorical Attribute: "verification_status_joint":
--------------------------------------------------------------
0 Not Known
1 Not Known
Name: verification_status_joint, dtype: category
Categories (4, object): [Not Known < Not Verified < Source Verified < Verified]
Value Counts:
-----------------
Not Known 886868
Not Verified 283
Verified 167
Source Verified 61
Name: verification_status_joint, dtype: int64
In [25]:
pprint(ordinal_categ_attribs)
In [26]:
categ_df[ordinal_categ_attribs].info()
In [27]:
for attrib in ordinal_categ_attribs:
s = 'Ordinal Categorical Attribute: "%s":' % attrib
print(s)
print('-' * (len(s)+3))
pprint(categ_df.loc[:3, attrib])
s = '\nValue Counts:'
print(s)
print('-' * (len(s)+3))
pprint(pd.value_counts(categ_df[attrib]))
print('\n')
'loan_status'
)Possible values of the 'loan_status'
attribute:
'Default'
['0A'
]: Loans that have been defaulted.'Does not meet the credit policy. Status:Charged Off'
(Written-off, ['0B'
]): Loans that do not meet the LC credit policy and has been written-off.'Charged Off'
(Written-off, ['0C'
]): The loan balance has been reduced to zero by recognizing the recorded value as an expense.'Late (31-120 days)'
(Red Loans, ['0D'
]): Loan the installment payments of which, have been delayed for 31-120 days.'Late (16-30 days)'
(Yellow Loans, ['0E'
]): Loan the installment payments of which have been delayed for 16-30 days.'Issued'
['1A'
]: Loans that have been just issued.'In Grace Period'
['1B'
]: Loans that have been issued and they are currently being in grace period.'Current'
['1C'
]: Loans that are currently active and serviced.'Does not meet the credit policy. Status:Fully Paid'
['1D'
]: Loans that do not meet the LC credit policy, but they have been fully-paid.'Fully Paid'
['1E'
]: Loans that have been fully paid.
In [28]:
s = 'Response Variable: "%s":' % attrib
print(s)
print('-' * (len(s)+3))
pprint(loan_df.loc[:3, response])
s = '\nValue Counts:'
print(s)
print('-' * (len(s)+3))
pprint(pd.value_counts(loan_df[response]))
print('\n')
In [29]:
# Group the loan classes in "Bad Loans" & "Good Loans"
bad_loans = ['Default', # ['0A']
'Does not meet the credit policy. Status:Charged Off', # ['0B']
'Charged Off', # ['0C']
'Late (31-120 days)', # ['0D']
'Late (16-30 days)'] # ['0E']
good_loans = ['Issued', # ['1A']
'In Grace Period', # ['1B']
'Current', # ['1C']
'Does not meet the credit policy. Status:Fully Paid', # ['1D']
'Fully Paid'] # ['1E']
# Denote the "Bad Loans" ("Good Loans") with the "0" ("1") Flag
loan_df.loc[:,response].replace(to_replace=bad_loans, value='0', inplace=True)
loan_df.loc[:,response].replace(to_replace=good_loans, value='1', inplace=True)
In [30]:
# BAD LOANS
num_bad_loans = len(loan_df.loc[loan_df[response] == '0', response])
print('Bad Loans: %d' % num_bad_loans)
# GOOD LOANS
num_good_loans = len(loan_df.loc[loan_df[response] == '1', response])
print('Good Loans: %d' % num_good_loans)
# BAD to GOOD LOANS RATIO
bad_to_good_loans_ratio = round(num_bad_loans / num_good_loans, ndigits=2)
print('Bad to Good Loans percentage: %.2f ' % (bad_to_good_loans_ratio * 100) + '%')
In [31]:
# Countplot of Loan Statuses
sns.countplot(x=response, data=loan_df)
plt.title('Countplot of Loan Statuses\n[\'loan_df\']')
plt.xticks(range(2), ['Bad Loans', 'Good Loans'], fontweight='bold')
plt.xlabel('Loan Status')
plt.ylabel('No. of Records')
plt.show()
In [32]:
freq_tables(loan_df, ['application_type'], response,
barplot=True, matplotlib_style='seaborn-whitegrid')
=> 'verification_status_joint'
: Also Not Important!
In [33]:
individual_loans_ix = (loan_df['application_type'] != 'JOINT')
for attrib in ['application_type', 'verification_status', 'verification_status_joint']:
s = 'Attribute: "%s":' % attrib
print(s)
print('-' * (len(s)+3))
pprint(loan_df.loc[individual_loans_ix, attrib].head(n=3))
s = '\nValue Counts:'
print(s)
print('-' * (len(s)+3))
pprint(pd.value_counts(loan_df.loc[individual_loans_ix, attrib]))
print('\n')
"loan_df"
DataFrameAccording to the previous discussion, the attributes below are not important for the response variable outcome, and we choose to remove them from the loan_df
data set, to enhance the predictive performance of our model.
'policy_code'
: Publicly available (policy_code=1), New products not publicly available (policy_code=2).'pymnt_plan'
: Indicates if a payment plan has been put in place for the loan.'application_type'
: Indicates whether the loan is an individual application or a joint application with two co-borrowers.'verified_status_joint'
: Indicates if the co-borrowers' joint income was verified by LC, not verified, or if the income source was verified.Furthermore, keeping the record lines of the 'JOINT'
loan applications is now meaningless, and we drop them. The same is true for both the attributes 'annual_inc_joint'
and 'dti_joint '
, which now have 'NA'
values only. We also drop them.
Finally, we are going not to take in consideration the few text attributes that exist in loan_df
, at least in this first stage of predictive modelling. These text attributes, which have been stored in the str_attribs
list above, are:
'desc'
: Loan description provided by the borrower.'emp_title'
: The job title supplied by the Borrower when applying for the loan (Note: employer title replaces Employer Name for all loans listed after 9/23/2013).'title'
: The loan title provided by the borrower.'url'
: URL for the LC page with listing data.
In [34]:
loan_df = loan_df.loc[individual_loans_ix,:]
In [35]:
loan_df.shape
Out[35]:
In [36]:
unimp_attribs = ['policy_code',
'pymnt_plan',
'application_type',
'verification_status_joint',
'annual_inc_joint',
'dti_joint']
for attrib in unimp_attribs:
del loan_df[attrib]
In [37]:
loan_df.shape
Out[37]:
Consequently, the attributes that we are going to further consider are enlisted below.
In [38]:
imp_attribs0 = list(loan_df.head()
.select_dtypes(exclude=['object']).columns)
In [39]:
loan_df.loc[:,imp_attribs0].info()
In [40]:
sns.distplot(loan_df.loc[:, 'loan_amnt'], hist=True, kde=True, rug=False,
norm_hist=True, axlabel='Loan Amount')
plt.title('Loan Amount Distribution')
plt.show()
In [41]:
grouped = loan_df.groupby(by=['issue_d'])
grouped_agg = (grouped['loan_amnt']
.agg(np.sum)
.rename('loanbook_amnt'))
grouped_agg_df = grouped_agg.reset_index()
grouped_agg_ts = pd.Series(data=grouped_agg_df['loanbook_amnt'].values,
index=grouped_agg_df['issue_d'])
del grouped_agg_df
fig = plt.figure(figsize=(13, 8))
ax = fig.add_subplot(111)
grouped_agg_ts.plot(ax=ax)
ax.set_xlabel('Date Issued [\'issue_d\']')
ax.set_ylabel('Loan Book Amount')
ax.set_title('Loan Book Growth\n[\'loan_df\']')
plt.show()
In [42]:
s = '\nBad Loans [Flagged as \'0\']:'
print(s)
print('-'*(len(s) + 3))
pprint(bad_loans)
s = '\nGood Loans [Flagged as \'1\']:'
print(s)
print('-'*(len(s) + 3))
pprint(good_loans)
In [43]:
# Countplot of Loan Statuses
sns.countplot(x=response, data=loan_df)
plt.title('Countplot of Loan Statuses\n[\'loan_df\']')
plt.xticks(range(2), ['Bad Loans', 'Good Loans'], fontweight='bold')
plt.xlabel('Loan Status')
plt.ylabel('No. of Records')
plt.show()
In [44]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(13,8), sharey=True)
sns.boxplot(x='loan_status', y='loan_amnt', data=loan_df, ax=ax1)
ax1.set_xticks(range(2))
ax1.set_xticklabels(['Bad Loans', 'Good Loans'], fontweight='bold')
ax1.set_xlabel('Loan Status')
ax1.set_ylabel('Loan Amount')
sns.boxplot(x='loan_status', y='loan_amnt', hue='grade', data=loan_df, ax=ax2)
ax2.set_xticks(range(2))
ax2.set_xticklabels(['Bad Loans', 'Good Loans'], fontweight='bold')
ax2.legend(loc='upper center', framealpha=0.5, prop={'weight': 'bold'})
ax2.legend_.set_title('Grade', prop={'weight': 'bold', 'size': 12})
ax2.set_xlabel('Loan Status')
ax2.set_ylabel('Loan Amount')
plt.suptitle('"Loan Amount" Distribution per "Loan Class" & "Grade"\n[\'loan_df\']',
fontsize=16, fontweight='bold')
plt.tight_layout(h_pad=2, w_pad=2, rect=[0,0,0.93,0.93])
plt.show()
In [45]:
fig = plt.figure(figsize=(13, 8))
ax = fig.add_subplot(111)
grouped = loan_df.groupby(by=['issue_d', 'grade'])
grouped_agg = (grouped['loan_amnt']
.agg(np.sum)
.rename('loanbook_amnt_per_grade'))
grouped_agg_df = grouped_agg.reset_index()
grouped_agg_df1 = grouped_agg_df.pivot_table(values='loanbook_amnt_per_grade',
index = 'issue_d',
columns=['grade'])
grouped_agg_df1.plot(ax=ax)
ax.set_title('Loan Book Growth per "Loan Grade"\n[\'loan_df\']')
ax.legend(loc='best', prop={'weight': 'bold'})
ax.legend_.set_title('Grade', prop={'weight': 'bold', 'size': 12})
ax.set_xlabel('Date Issued [\'issue_d\']')
ax.set_ylabel('Loan Book Amount')
plt.show()
Here, we provide two choropleth maps concerning the Loan Book Value and the Loan Book Volume distribution across the U.S. States. To do so, we have used the "Bokeh"
Python library, a GeoJSON file which defines the U.S. States boundaries and it has been produced from a cartographic boundary shapefile that is provided from the official site of the U.S. Census Bureau, and the Pandas DataFrame grouped_agg_df
, where we aggregate the number, and the value of loans per U.S. State. "Bokeh"
is a Python library for interactive D3 visualizations!
In [46]:
# Compute the "Loan Book Amount & Volume" per "US State"
grouped = loan_df.groupby(by=['addr_state'])
grouped_agg = (grouped[['loan_amnt']].agg(np.sum)
.rename(columns={'loan_amnt': 'loanbook_amnt_per_state'}))
grouped_agg['loanbook_vol_per_state'] = grouped['loan_amnt'].agg(np.count_nonzero)
grouped_agg_df = grouped_agg.reset_index()
grouped_agg_df.head()
Out[46]:
In [47]:
# Load the necessary libraries for the D3 Visualization
from bokeh.io import show, output_notebook
from bokeh.palettes import (
YlOrRd9 as palette1,
YlGnBu9 as palette2)
from bokeh.plotting import figure
from bokeh.models import (
GeoJSONDataSource,
LogColorMapper,
HoverTool,
LogTicker,
ColorBar)
# Load the enriched GeoJSON Data Source, with the loanbook measures of interest
with open(us_states_GeoJSON, 'r') as f:
geo_source = GeoJSONDataSource(geojson=f.read())
# Output the Choropleth Plots in Notebook
output_notebook()
# PROVIDE THE CHOROPLETH OF "LOAN BOOK AMOUNT PER STATE"
palette1.reverse()
color_mapper = LogColorMapper(palette=palette1,
low=grouped_agg_df['loanbook_amnt_per_state'].min(),
high=grouped_agg_df['loanbook_amnt_per_state'].max())
# Define the figure "Tools" we want to make available
TOOLS = "pan, wheel_zoom, reset, hover, save"
# Plot the figure
# Define the figure dimensions and its general details
p = figure(title="Loan Book Value by U.S. States", tools=TOOLS,
plot_width=960, plot_height=500,
x_range=(0, 960), y_range=(500, 0),
x_axis_location=None, y_axis_location=None)
# Render the "Bokeh" patches in Glyph
p.patches('xs', 'ys', source=geo_source,
fill_color={'field': "loanbook_amnt_per_state" ,'transform': color_mapper},
fill_alpha=0.7, line_color="white", line_width=0.5)
# Add a Hover Tools over the US States
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
("State", "@state"),
("Loan Book Amount", "@loanbook_amnt_per_state{,.2f} USD"),
("(Long, Lat)", "($x, $y)"),
]
# Add a ColorBar Legend
color_bar = ColorBar(color_mapper=color_mapper, ticker=LogTicker(),
background_fill_alpha=0.7,
label_standoff=5,
major_label_text_color='black',
major_tick_line_color='black', major_tick_line_width=1.3, major_tick_out=5,
border_line_color=None, location=(0,0),
orientation='horizontal', width=500)
p.add_layout(color_bar, 'above')
show(p)
In [48]:
# PROVIDE THE CHOROPLETH OF "LOAN BOOK VOLUME PER STATE"
palette2.reverse()
color_mapper = LogColorMapper(palette=palette2,
low=grouped_agg_df['loanbook_vol_per_state'].min(),
high=grouped_agg_df['loanbook_vol_per_state'].max())
# Define the figure "Tools" we want to make available
TOOLS = "pan, wheel_zoom, reset, hover, save"
# Plot the figure
# Define the figure dimensions and its general details
p = figure(title="Loan Book Volume by U.S. States", tools=TOOLS,
plot_width=960, plot_height=500,
x_range=(0, 960), y_range=(500, 0),
x_axis_location=None, y_axis_location=None)
# Render the "Bokeh" patches in Glyph
p.patches('xs', 'ys', source=geo_source,
fill_color={'field': "loanbook_vol_per_state" ,'transform': color_mapper},
fill_alpha=0.7, line_color="white", line_width=0.5)
# Add a Hover Tools over the US States
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
("State", "@state"),
("Loan Book Volume", "@loanbook_vol_per_state{,}"),
("(Long, Lat)", "($x, $y)"),
]
# Add a ColorBar Legend
color_bar = ColorBar(color_mapper=color_mapper, ticker=LogTicker(),
background_fill_alpha=0.7,
label_standoff=5,
major_label_text_color='black',
major_tick_line_color='black', major_tick_line_width=1.3, major_tick_out=5,
border_line_color=None, location=(0,0),
orientation='horizontal', width=500)
p.add_layout(color_bar, 'above')
show(p)
In [49]:
freq_tables(loan_df, ['purpose'], response,
barplot=True, matplotlib_style='seaborn-whitegrid')
In [50]:
freq_tables(loan_df, ['grade'], response,
barplot=True, matplotlib_style='seaborn-whitegrid')
In [51]:
sns.boxplot(x='grade', y='int_rate', data=loan_df)
plt.xlabel('Loan Grade')
plt.ylabel('Interest Rate')
plt.title('"Interest Rate" by "Loan Grade"\n[\'loan_df\']',
fontsize=16, fontweight='bold')
plt.xticks(range(7), list(loan_df['grade'].cat.categories),
fontweight='bold')
plt.show()
In [ ]: